Towards Exascale Scientific Metadata Management
نویسندگان
چکیده
Advances in technology and computing hardware are enabling scientists from all areas of science to produce massive amounts of data using large-scale simulations or observational facilities. In this era of data deluge, effective coordination between the data production and the analysis phases hinges on the availability of metadata that describe the scientific datasets. Existing workflow engines have been capturing a limited form of metadata to provide provenance information about the identity and lineage of the data. However, much of the data produced by simulations, experiments, and analyses still need to be annotated manually in an ad hoc manner by domain scientists. Systematic and transparent acquisition of rich metadata becomes a crucial prerequisite to sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and domain-agnostic metadata management infrastructure that can meet the demands of extreme-scale science is notable by its absence. To address this gap in scientific data management research and practice, we present our vision for an integrated approach that (1) automatically captures and manipulates information-rich metadata while the data is being produced or analyzed and (2) stores metadata within each dataset to permeate metadataoblivious processes and to query metadata through established and standardized data access interfaces. We motivate the need for the proposed integrated approach using applications from plasma physics, climate modeling and neuroscience, and then discuss research challenges and possible solutions.
منابع مشابه
Pantheon: Exascale File System Search for Scientific Computing
Modern scientific computing generates petabytes of data in billions of files that must be managed. These files are often organized, by name, in a hierarchical directory tree common to most file systems. As the scale of data has increased, this has proven to be a poor method of file organization. Recent tools have allowed for users to navigate files based on file metadata attributes to provide m...
متن کاملTowards Exascale Distributed Data Management
Exascale eScience infrastructures” will face important and critical challenges both from computational and data perspectives. Increasingly complex and parallel scientific codes will lead to the production of huge amount of data. The large volume of data and the time needed to locate, access, analyze and visualize data will greatly impact on the scientific productivity of scientists and research...
متن کاملMetadata Workloads for Testing Big Storage Systems
Efficient namespace metadata management is becoming more important as next-generation file systems are designed for the peta and exascale era. A number of newmetadata management schemes have been proposed. However, evaluation of these designs has been insufficient, mainly due to a lack of appropriate namespace metadata traces. Specifically, no Big Data storage system metadata trace is publicly ...
متن کاملTowards Supporting Data-Intensive Scientific Applications on Extreme-Scale High-Performance Computing Systems
Many believe that the state-of-the-art yet decades old high-performance computing (HPC) storage would not meet the I/O requirement of the emerging exascale mainly due to the segregation of compute and storage resources. Indeed, our simulation predicts, quantitatively, that the efficiency and availability would go towards zero as the system scales approach exascale. This work proposes a new arch...
متن کاملStorage Support for Data-Intensive Applications on Large Scale High-Performance Computing Systems
Many believe that the state-of-the-art yet decades old high-performance computing (HPC) storage would not meet the I/O requirement of the emerging exascale mainly due to the segregation of compute and storage resources. Indeed, our simulation predicts, quantitatively, that the efficiency and availability would go towards zero as the system scales approach exascale. This work proposes a new arch...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1503.08482 شماره
صفحات -
تاریخ انتشار 2015